Real-estate Transcation Behavior in Sacramento

by Anthony Terrence O’Brien Villate MD

========================================================

Abstract

In this analysis we explore dataon real estate transactions in 2016 over a five day period. The key goal of this study was to determine what were the main factors in the dataset which drove the price of the real estate transactions.

Introduction

SpatialKey is a private company which buids geospatial insturance technology that simplifies how insureers view risk by building catastrophe data, mapping, and analytics. As part of their training resources, Spatialkey provides a plethora of datasets with geospatial information, which can be downloaded from this url: https://support.spatialkey.com/spatialkey-sample-csv-data/

This.csv file is the Sacramento real estate transcations (SRET) file provided by SpatialKey. The file is a list of 985 real estate transactions in the Sacramento county reported over a five-day period in 2016, as reported by the Sacramento Bee.

The Sacramento Bee is the largest newspaper in Sacramento, the fifth largest newspaper in California, and the 27th largest paper in the USA.

Sacramento County is a county of the U.S State of California, and according to the 2010 census the population was 1,418,788. Also it contains the state capital of California, Sacramento, which has an estimated popualtion of 493,025; making it the 6th largest city in California.

Summary of the Data Set

The SRET dataset was chosen as it provides a rich range of datatypes, in a tidy format, with the option for practicing multiple editing, analytical and visual techniques.

The datset itself contains the following variables:

  1. street: The physical address of the property. Categorical variable, and it will be dropped due to redundancy.

  2. city: The city of the property within the Sacramento county. Categorical variable.

  3. zip: The USPS Zone Improvement Plan used in postal addresses in the USA. Categorical variable organized in a numeric format.

  4. state: The state in which the poperty is located. Categorical variable, which will be dropped as it does not provide useful information. All proerties are in the state of California

  5. beds: The number of beds the property has. Quantitative variable

  6. baths: The number of bathrooms the property has. Quantitative variable

  7. sq__ft: The square feet the property has. Quantitative variable which will be converted to square meters

  8. type: The type of property. Categorical variable

  9. sale_date: The day the property was sold. timedate variable which will be parsed into day

  10. price: The price at which the propery was sold. Quantitative variable

  11. latitude: The geospatial location based on latitude of the property. Geospatial data

  12. longnitude: The geospatial location based on longnitute of the property. Geospatial data

Please note, all referenced websites were viewed on April 18th 2018

Dataset exploration and transformation

First I review the dataset dimensions, variables and structure

The above dataset has a few variables which I want to drop or change. Particularly I want to remove street, stat and sale_datee, and convert sq__ft to square meters.

df <- subset(df, select = -c(street, state))# drops street and state variable
df['sq_m'] <- df$sq__ft*0.093# add a new column for data which is square meters
#test <-strsplit(as.character(df$sale_date), 'May') #parse data
#df['sale_day'] <-do.call(rbind, test) #add the day to the datafram
#df <- subset(df, select = -c(sale_date)) #drop the datetime
df <- subset(df, select = -c(sq__ft)) #drop the square feet

I review the dataset after the transformation.

Univariate Plots Section

Summary of city data

As a quick overview of the cities included in the analysis, I perform a quick, aggregate of the variable city using a table, and then by visualizing the data in a bar and pie chart

The pie chart code is derived from the following url http://www.sthda.com.

What stands out form this initial analysis is that there are 3 cities with an extraordinarly higher number of transactions relative to the other cities. These three cities are Sacramento, Elk Grove and Lincoln.

Summary of house types

From this point onwards I perform similar exploratory analysis and visualizations on the remainding variables.

From initial inspection it does not seem that the variable type will provide much insight into the data.

Summary of number of beds

summary(df$beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.912   4.000   8.000
ggplot(df, aes(x=beds, fill=beds)) + 
  geom_bar()

There is a condierable number of missing data in beds, which can be later removed. Also, This data seems like it would provide useful information for understanding the transactional patterns in the data. Also at present this variable is an integer which should be converted to a factor variable to really appreciate its usefulness.

Summary of number of baths

summary(df$baths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.777   2.000   5.000
ggplot(df, aes(x=baths, fill=baths)) + 
  geom_bar()

The variable baths has similar properties as the variable beds.

Summary of sales price

summary(df$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1551  145000  213800  234100  300000  884800
ggplot(aes(x = price), data = df) +
  geom_histogram() +
  scale_x_continuous()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is one of the most important variable sin this dataset as we are studying property sales. Visually we note a high number of missing values and that the data is skewed to the right due to high sales numbers.

Summary of square meters for property

This variable shares similar proerties like the varibale price

Univariate Analysis

With this initial univariate analysis, there are interesting findings to report:

  1. There is a substantial amount of property located in Sacramento (n=434), followed by Elk Grove (n=114) and Lincoln (n=72).

  2. Most properties are reported as Residential.

  3. Most properties have 3 bedrooms and 2 bathrooms.

  4. The median cost of properties is $213,800, however there are also outlier above $750,000 with a max of $884,800.

  5. The median square meter space of the properties is 121.3 m^2, with the third quartile in 159.80m^2. Interestingly the max area reported is 541.4.

  6. In the square meter area for properties there is a significant amount of properties that report 0 as their area. The same for price.

Some initial questions which this analysis provokes are:

  1. It would be interesting to see if there are any particular properties of Sacramento, Elk Grove and Lincoln, relative to the available data, that can suggest why they are so popular as a sales area.

  2. Additionally it will be interesting to run similar analysis for properties which are considerably outside the normal ditribution of the data. For example, this can include data for properties which are exceptionally pricey or that have large surface area for sale.

  3. Likewise it will be interesting to see if there are any particular patterns for unreported data (eg. they may be from the same area, or for certain listings).

Bivariate Plots and Analysis Section

The variables ‘’latitude’ and ‘longitude’ contain geospatial data. This data can be used to create a scatterplot map of the property areas. The goespatial information can be used to visually observe sales patterns in the data. Potentially helping us to generate hypothesis to answer questions raised in the Univariate analysis section.

What is the geospatial relationship between sales transactions and location?

Leveraging the geospatial data, we tag its color using the city variable

With the above information we can overlay a map to observe how prevalent sales are relative to the city in which they were sold.

Therefore, this data is combined with map images pulled from google maps

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

This mapping procedure was derived from teh following url: http://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html

From this analysis we see that the three largest areas are representative of the following: 1) Sacramento is Californias capital city and therefore a main point of concentration for sales. 2) The next two main areas for sales (i.e. Elk Grove and Lincoln) are in the immediate vicinity of Sacramento (the south and North East, respectively). We also see that the further the areas are from Sacramento the less number of sales transcations occur.

Does the type of property affect the distribution of the sale?

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

Bsaed on the map image above it does not seem that the type of property affects the geospatial distribution of the sale. It seems that the type of property is not a useful variable for describing the data. Frpm this point on I will not include the type of property in the analysis.

Distribution matrix to find hidden relationships

From the distribution matrix we see that:

  1. The number of bedrooms and bathrooms a property has is highly correlated (i.e as the number of bedrooms increases the number of bathrroms increases, r = 0.84)

  2. Similarly, there is a weak positive relationship between the price of a property and the number of bedrooms and bathrooms (r = 0.34 and r = 0.42 respectively).

  3. Expectedly there is also a moderate posiitve relationshp between the square meter area and number of bedrooms (r=0.68) and bathrooms(r = 0.67)

  4. Intersetingly there is also a weak positive relationship between the price and the square meter area (r = 0.33).

The above data suprisingly suggests that the price of a property while related to its square meter area is not that substantially correlated to size.

After reviewing the distribution matrix, I was curious if the location may play more of a factor in determining sales prices rather than the number of bedrooms, bathrooms or size.

Converting integers to factors

Knowing that the above variables for beds and baths are set as an integer, I convert them to factor variables and plot them in relation to the size and price of the property sold.

df[,'beds_factor'] <- as.factor(df[,'beds']) #convert data to factor variables
df[,'baths_factor'] <- as.factor(df[,'baths'])

The above grid method layout was derived from the following URL: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html

At this point I replace 0 with NA and remove from the plot

df[, 11:12][df[, 11:12] == 0] <- NA #convert 0 to NA

And then I replot the above graphs excluding the missing variables

The method of NA removal was obtained from the following URL: https://stackoverflow.com/questions/11403104/remove-unused-factor- levels-from-a-ggplot-bar-plot

By converting the number of bedrooms and the number of bathrooms to a factor variable, we unveil a relationship in which in the size per square meter and the price increase relative to the number of bathrooms and bedrooms.

Altough it is intersting to see that this relationship does not exactly hold for properties with 6 or more bedrooms or 5 bathrooms. Other aspects like location may have influenced this outcome for price. Though it is not clear why th size of location would decrease relative to the number of rooms.

Given the relationship of bedroom and bathroom with price and size, plot thevariable beds and baths relative to the geospatial data.

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

In general it is difficult to discern any patterns, other than areas further away from Sacramento (South, South-East, North and North-East) have a higher number of bedrooms relative to Sacramento.

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

Like number of bedrooms, there is a distribution which seems to lead to increased number of bathrooms the furhter away one mooves from Sacramento.

Price range according to geographic location

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

The above map suggests that tmore expensive areas are found outside of the Sacramento area, and more towards the north east around Granite Bay, Folsom, El Dorado Hills, and Cameron Park. While a couple of more expensive sales also occured in the bottom right corner of the map around the Wilton area.

Location and size

##       left     bottom      right        top 
## -121.59941   38.20255 -120.54989   39.05977

While not as prominent a feature, as one goes out of Sacramento area the square meter area per property tends to increase.

Price and size

Prior analysis suggests that there is a linear relationship between size and property. Therefore to study this relationship I perform a scatterplot.

In general there is a visual relationship betwen price and size. However this relationship seems to hold until around $375,000.

To confirm this observation I plot a linera regression and lowess plot, on top of the scatterplot.

From here, I zoom in on to the data and then readjust the plot to redraw the regression lines as if the data after $375,000 was not considered

We see that when we consider the parsed data alone that there is a strong linear relationship between size and price.

At this point I hypothesize that the relationship between price and size is relative to the properties location. Similarly, as number of bedrooms and bath rooms relate to size, I plotted a similar scatter Plot

We see that most of the pricing data is determined by Sacramento, and that anything under or equall to the median number of bedrooms and bathrooms is generally agregated within the area represented by Sacramento.

Multivariate Plots and Analysis Section

Now that I have visually analyzed the data, I will run a few univariate linear regression models separately to determine which significant relationships cna be added to a multiple linear regression model.

Regresion model

## 
## Call:
## lm(formula = I(price) ~ I(city), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272000  -75200  -18028   47264  501264 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              232496      18896  12.304  < 2e-16 ***
## I(city)AUBURN            173394      52091   3.329 0.000906 ***
## I(city)CAMERON PARK       35448      40819   0.868 0.385383    
## I(city)CARMICHAEL         63188      30760   2.054 0.040224 *  
## I(city)CITRUS HEIGHTS    -45382      26338  -1.723 0.085204 .  
## I(city)COOL               67504     110179   0.613 0.540239    
## I(city)DIAMOND SPRINGS   -16463     110179  -0.149 0.881250    
## I(city)EL DORADO          14504      79045   0.183 0.854457    
## I(city)EL DORADO HILLS   259203      29484   8.791  < 2e-16 ***
## I(city)ELK GROVE          38661      21457   1.802 0.071891 .  
## I(city)ELVERTA           -99630      57468  -1.734 0.083305 .  
## I(city)FAIR OAKS          71004      40819   1.739 0.082273 .  
## I(city)FOLSOM            182464      32406   5.631 2.37e-08 ***
## I(city)FORESTHILL        -37678     110179  -0.342 0.732445    
## I(city)GALT                4447      30300   0.147 0.883348    
## I(city)GARDEN VALLEY     257504     110179   2.337 0.019639 *  
## I(city)GOLD RIVER        125504      57468   2.184 0.029217 *  
## I(city)GRANITE BAY       446237      65456   6.817 1.65e-11 ***
## I(city)GREENWOOD         162504     110179   1.475 0.140569    
## I(city)LINCOLN          -135957      22819  -5.958 3.59e-09 ***
## I(city)LOOMIS            334504      79045   4.232 2.54e-05 ***
## I(city)MATHER              5304     110179   0.048 0.961618    
## I(city)MEADOW VISTA       -2496     110179  -0.023 0.981928    
## I(city)NORTH HIGHLANDS   -96837      30300  -3.196 0.001440 ** 
## I(city)ORANGEVALE         46663      37791   1.235 0.217223    
## I(city)PENRYN            274192     110179   2.489 0.012995 *  
## I(city)PLACERVILLE       131367      39183   3.353 0.000832 ***
## I(city)POLLOCK PINES       7806      65456   0.119 0.905095    
## I(city)RANCHO CORDOVA     30910      27890   1.108 0.268023    
## I(city)RANCHO MURIETA     65254      65456   0.997 0.319063    
## I(city)RIO LINDA         -59769      35544  -1.682 0.092987 .  
## I(city)ROCKLIN           149339      32406   4.608 4.61e-06 ***
## I(city)ROSEVILLE          92032      24546   3.749 0.000188 ***
## I(city)SACRAMENTO        -34760      19593  -1.774 0.076361 .  
## I(city)SHINGLE SPRINGS    42504     110179   0.386 0.699754    
## I(city)SLOUGHHOUSE      -230496     110179  -2.092 0.036702 *  
## I(city)WALNUT GROVE      147504     110179   1.339 0.180968    
## I(city)WEST SACRAMENTO   -61796      65456  -0.944 0.345363    
## I(city)WILTON            385012      52091   7.391 3.20e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108500 on 946 degrees of freedom
## Multiple R-squared:  0.4083, Adjusted R-squared:  0.3846 
## F-statistic: 17.18 on 38 and 946 DF,  p-value: < 2.2e-16
## 
## Calls:
## m1: lm(formula = I(price) ~ I(city), data = df)
## 
## ============================================================
##   (Intercept)                                232496.394***  
##                                              (18895.476)    
##   I(city): AUBURN/ANTELOPE                   173394.406***  
##                                              (52091.231)    
##   I(city): CAMERON PARK/ANTELOPE              35448.051     
##                                              (40818.893)    
##   I(city): CARMICHAEL/ANTELOPE                63188.356*    
##                                              (30759.606)    
##   I(city): CITRUS HEIGHTS/ANTELOPE           -45381.480     
##                                              (26337.725)    
##   I(city): COOL/ANTELOPE                      67503.606     
##                                             (110178.609)    
##   I(city): DIAMOND SPRINGS/ANTELOPE          -16463.394     
##                                             (110178.609)    
##   I(city): EL DORADO/ANTELOPE                 14503.606     
##                                              (79045.446)    
##   I(city): EL DORADO/ANTELOPE HILLS          259202.563***  
##                                              (29484.103)    
##   I(city): ELK GROVE/ANTELOPE                 38661.299     
##                                              (21456.756)    
##   I(city): ELVERTA/ANTELOPE                  -99630.394     
##                                              (57468.346)    
##   I(city): FAIR OAKS/ANTELOPE                 71004.273     
##                                              (40818.893)    
##   I(city): FOLSOM/ANTELOPE                   182463.783***  
##                                              (32405.473)    
##   I(city): FORESTHILL/ANTELOPE               -37678.394     
##                                             (110178.609)    
##   I(city): GALT/ANTELOPE                       4447.035     
##                                              (30300.170)    
##   I(city): GARDEN VALLEY/ANTELOPE            257503.606*    
##                                             (110178.609)    
##   I(city): GOLD RIVER/ANTELOPE               125503.606*    
##                                              (57468.346)    
##   I(city): GRANITE BAY/ANTELOPE              446236.939***  
##                                              (65455.848)    
##   I(city): GREENWOOD/ANTELOPE                162503.606     
##                                             (110178.609)    
##   I(city): LINCOLN/ANTELOPE                 -135957.130***  
##                                              (22818.455)    
##   I(city): LOOMIS/ANTELOPE                   334503.606***  
##                                              (79045.446)    
##   I(city): MATHER/ANTELOPE                     5303.606     
##                                             (110178.609)    
##   I(city): MEADOW VISTA/ANTELOPE              -2496.394     
##                                             (110178.609)    
##   I(city): NORTH HIGHLANDS/ANTELOPE          -96837.061**   
##                                              (30300.170)    
##   I(city): ORANGEVALE/ANTELOPE                46663.152     
##                                              (37790.951)    
##   I(city): PENRYN/ANTELOPE                   274191.606*    
##                                             (110178.609)    
##   I(city): PLACERVILLE/ANTELOPE              131367.006***  
##                                              (39182.492)    
##   I(city): POLLOCK PINES/ANTELOPE              7806.273     
##                                              (65455.848)    
##   I(city): RANCHO CORDOVA/ANTELOPE            30909.535     
##                                              (27889.693)    
##   I(city): RANCHO MURIETA/ANTELOPE            65253.606     
##                                              (65455.848)    
##   I(city): RIO LINDA/ANTELOPE                -59768.779     
##                                              (35543.899)    
##   I(city): ROCKLIN/ANTELOPE                  149339.430***  
##                                              (32405.473)    
##   I(city): ROSEVILLE/ANTELOPE                 92031.856***  
##                                              (24545.943)    
##   I(city): SACRAMENTO/ANTELOPE               -34760.405     
##                                              (19592.803)    
##   I(city): SHINGLE SPRINGS/ANTELOPE           42503.606     
##                                             (110178.609)    
##   I(city): SLOUGHHOUSE/ANTELOPE             -230496.394*    
##                                             (110178.609)    
##   I(city): WALNUT GROVE/ANTELOPE             147503.606     
##                                             (110178.609)    
##   I(city): WEST SACRAMENTO/ANTELOPE          -61796.394     
##                                              (65455.848)    
##   I(city): WILTON/ANTELOPE                   385012.006***  
##                                              (52091.231)    
## ------------------------------------------------------------
##   R-squared                                       0.408     
##   adj. R-squared                                  0.385     
##   sigma                                      108546.243     
##   F                                              17.182     
##   p                                               0.000     
##   Log-likelihood                             -12798.766     
##   Deviance                           11146043479543.883     
##   AIC                                         25677.531     
##   BIC                                         25873.237     
##   N                                             985         
## ============================================================
## 
## Call:
## lm(formula = I(price) ~ I(beds_factor), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278363  -75613  -22260   46740  601530 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       109646      36503   3.004  0.00274 ** 
## I(beds_factor)2    55782      37850   1.474  0.14091    
## I(beds_factor)3   105967      36942   2.868  0.00422 ** 
## I(beds_factor)4   173614      37203   4.667 3.54e-06 ***
## I(beds_factor)5   289895      39475   7.344 4.78e-13 ***
## I(beds_factor)6   320321      75986   4.215 2.75e-05 ***
## I(beds_factor)8   172754     121066   1.427  0.15396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115400 on 870 degrees of freedom
##   (108 observations deleted due to missingness)
## Multiple R-squared:  0.219,  Adjusted R-squared:  0.2136 
## F-statistic: 40.65 on 6 and 870 DF,  p-value: < 2.2e-16
## 
## Calls:
## m1: lm(formula = I(price) ~ I(beds_factor), data = df)
## 
## ==============================================
##   (Intercept)                  109646.200**   
##                                (36502.668)    
##   I(beds_factor): 2/1           55782.101     
##                                (37850.081)    
##   I(beds_factor): 3/1          105966.715**   
##                                (36941.946)    
##   I(beds_factor): 4/1          173613.893***  
##                                (37203.359)    
##   I(beds_factor): 5/1          289895.342***  
##                                (39475.092)    
##   I(beds_factor): 6/1          320320.467***  
##                                (75986.363)    
##   I(beds_factor): 8/1          172753.800     
##                               (121065.654)    
## ----------------------------------------------
##   R-squared                         0.219     
##   adj. R-squared                    0.214     
##   sigma                        115431.572     
##   F                                40.649     
##   p                                 0.000     
##   Log-likelihood               -11463.587     
##   Deviance             11592269584516.680     
##   AIC                           22943.174     
##   BIC                           22981.386     
##   N                               877         
## ==============================================
## 
## Call:
## lm(formula = I(price) ~ I(baths_factor), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378801  -70352  -20599   52475  504438 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        159025       8047  19.762  < 2e-16 ***
## I(baths_factor)2    67474       9283   7.268 8.09e-13 ***
## I(baths_factor)3   221328      12372  17.890  < 2e-16 ***
## I(baths_factor)4   199741      26043   7.670 4.60e-14 ***
## I(baths_factor)5   595975      76764   7.764 2.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108000 on 872 degrees of freedom
##   (108 observations deleted due to missingness)
## Multiple R-squared:  0.3152, Adjusted R-squared:  0.3121 
## F-statistic: 100.3 on 4 and 872 DF,  p-value: < 2.2e-16
## 
## Calls:
## m1: lm(formula = I(price) ~ I(baths_factor), data = df)
## 
## ===============================================
##   (Intercept)                   159024.667***  
##                                  (8047.057)    
##   I(baths_factor): 2/1           67474.309***  
##                                  (9283.396)    
##   I(baths_factor): 3/1          221327.477***  
##                                 (12371.643)    
##   I(baths_factor): 4/1          199741.070***  
##                                 (26042.747)    
##   I(baths_factor): 5/1          595975.333***  
##                                 (76764.028)    
## -----------------------------------------------
##   R-squared                          0.315     
##   adj. R-squared                     0.312     
##   sigma                         107962.594     
##   F                                100.337     
##   p                                  0.000     
##   Log-likelihood                -11405.929     
##   Deviance              10163963662608.637     
##   AIC                            22823.857     
##   BIC                            22852.516     
##   N                                877         
## ===============================================

From the initial univariate analysis, we see that city relates best to the transaction listings, followed by number of bedrooms, then bathrooms, and finally square meters.

Using this insight the multivariate model will be created accordingly.

## 
## Calls:
## m1: lm(formula = I(price) ~ I(city), data = df)
## m2: lm(formula = I(price) ~ I(city) + beds_factor, data = df)
## m3: lm(formula = I(price) ~ I(city) + beds_factor + baths_factor, 
##     data = df)
## m4: lm(formula = I(price) ~ I(city) + beds_factor + baths_factor + 
##     sq_m, data = df)
## 
## ==============================================================================================================================
##                                               m1                     m2                    m3                    m4           
## ------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                232496.394***         124614.829***         113118.076***          68909.091*    
##                                              (18895.476)           (34347.647)           (32183.782)           (30906.892)    
##   I(city): AUBURN/ANTELOPE                   173394.406***         192972.896***         148385.426***         199816.315***  
##                                              (52091.231)           (45437.249)           (42741.242)           (40938.671)    
##   I(city): CAMERON PARK/ANTELOPE              35448.051             87090.987*            97113.738**          163656.195***  
##                                              (40818.893)           (37318.735)           (34941.173)           (33898.800)    
##   I(city): CARMICHAEL/ANTELOPE                63188.356*            81857.599**           94526.504***          91455.726***  
##                                              (30759.606)           (26799.473)           (25121.226)           (23855.654)    
##   I(city): CITRUS HEIGHTS/ANTELOPE           -45381.480            -36366.535            -23167.962             -7836.600     
##                                              (26337.725)           (22925.978)           (21549.351)           (20524.504)    
##   I(city): COOL/ANTELOPE                      67503.606             90555.327             97534.517            100249.768     
##                                             (110178.609)           (95891.333)           (89706.308)           (85179.796)    
##   I(city): DIAMOND SPRINGS/ANTELOPE          -16463.394              6588.327             13567.517             25269.831     
##                                             (110178.609)           (95891.333)           (89706.308)           (85188.102)    
##   I(city): EL DORADO/ANTELOPE                 14503.606             59672.417             73434.229             69652.137     
##                                              (79045.446)           (68878.246)           (64480.978)           (61228.253)    
##   I(city): EL DORADO/ANTELOPE HILLS          259202.563***         208972.941***         188971.604***         287260.182***  
##                                              (29484.103)           (26986.879)           (25383.023)           (26199.489)    
##   I(city): ELK GROVE/ANTELOPE                 38661.299             24816.562             28178.670             19026.113     
##                                              (21456.756)           (18861.814)           (17651.782)           (16788.262)    
##   I(city): ELVERTA/ANTELOPE                  -99630.394            -89807.987            -74842.787            -55989.823     
##                                              (57468.346)           (50019.442)           (46824.851)           (44505.488)    
##   I(city): FAIR OAKS/ANTELOPE                 71004.273             88435.662*            63439.741             52768.257     
##                                              (40818.893)           (35568.771)           (34292.673)           (32581.201)    
##   I(city): FOLSOM/ANTELOPE                   182463.783***         162608.628***         157366.106***         139638.627***  
##                                              (32405.473)           (29451.061)           (27564.277)           (26238.742)    
##   I(city): FORESTHILL/ANTELOPE               -37678.394                                                                       
##                                             (110178.609)                                                                      
##   I(city): GALT/ANTELOPE                       4447.035             -6523.542              4224.833              8326.591     
##                                              (30300.170)           (26783.758)           (25137.937)           (23873.212)    
##   I(city): GARDEN VALLEY/ANTELOPE            257503.606*           280555.327**          287534.517**          373651.999***  
##                                             (110178.609)           (95891.333)           (89706.308)           (85653.423)    
##   I(city): GOLD RIVER/ANTELOPE               125503.606*           164994.301**          179094.705***         140680.788**   
##                                              (57468.346)           (56996.901)           (53363.501)           (50829.309)    
##   I(city): GRANITE BAY/ANTELOPE              446236.939***         358948.679***         341118.095***         311888.570***  
##                                              (65455.848)           (57486.169)           (53917.648)           (51287.764)    
##   I(city): GREENWOOD/ANTELOPE                162503.606            185555.327            192534.517*           115740.023     
##                                             (110178.609)           (95891.333)           (89706.308)           (85556.543)    
##   I(city): LINCOLN/ANTELOPE                 -135957.130***           8004.227              7671.676             83588.350***  
##                                              (22818.455)           (26400.694)           (24717.040)           (24774.341)    
##   I(city): LOOMIS/ANTELOPE                   334503.606***         304638.071***         280971.416***         323481.452***  
##                                              (79045.446)           (68961.750)           (65533.039)           (62384.322)    
##   I(city): MATHER/ANTELOPE                     5303.606             28355.327             35334.517              1643.578     
##                                             (110178.609)           (95891.333)           (89706.308)           (85252.057)    
##   I(city): MEADOW VISTA/ANTELOPE              -2496.394             20555.327             51680.160             55436.558     
##                                             (110178.609)           (95891.333)           (90105.359)           (85559.138)    
##   I(city): NORTH HIGHLANDS/ANTELOPE          -96837.061**          -77132.168**          -56249.069*           -40658.848     
##                                              (30300.170)           (26416.622)           (24970.623)           (23766.402)    
##   I(city): ORANGEVALE/ANTELOPE                46663.152             49682.863             63230.354*            62987.711*    
##                                              (37790.951)           (32914.154)           (30965.738)           (29403.079)    
##   I(city): PENRYN/ANTELOPE                   274191.606*           297243.327**          304222.517***         314779.982***  
##                                             (110178.609)           (95891.333)           (89706.308)           (85186.468)    
##   I(city): PLACERVILLE/ANTELOPE              131367.006***         152123.817***         148994.522***         189026.637***  
##                                              (39182.492)           (35528.714)           (33242.944)           (31841.375)    
##   I(city): POLLOCK PINES/ANTELOPE              7806.273             45602.720             57103.659             58101.419     
##                                              (65455.848)           (57016.519)           (53363.690)           (50670.825)    
##   I(city): RANCHO CORDOVA/ANTELOPE            30909.535             -9665.643             -8903.569             -8482.638     
##                                              (27889.693)           (26403.635)           (24791.596)           (23540.542)    
##   I(city): RANCHO MURIETA/ANTELOPE            65253.606            135388.071*           124217.867             69674.480     
##                                              (65455.848)           (68961.750)           (64534.055)           (61541.907)    
##   I(city): RIO LINDA/ANTELOPE                -59768.779            -22584.087             -4877.443              6359.794     
##                                              (35543.899)           (31887.974)           (29931.531)           (28445.302)    
##   I(city): ROCKLIN/ANTELOPE                  149339.430***         133267.190***         135829.423***         174516.705***  
##                                              (32405.473)           (30947.404)           (28949.821)           (27784.601)    
##   I(city): ROSEVILLE/ANTELOPE                 92031.856***          83211.553***          83998.253***         110526.580***  
##                                              (24545.943)           (22217.362)           (20801.039)           (19944.935)    
##   I(city): SACRAMENTO/ANTELOPE               -34760.405            -24914.927            -11384.512             -6118.953     
##                                              (19592.803)           (17114.899)           (16168.603)           (15362.521)    
##   I(city): SHINGLE SPRINGS/ANTELOPE           42503.606                                                                       
##                                             (110178.609)                                                                      
##   I(city): SLOUGHHOUSE/ANTELOPE             -230496.394*          -207444.673*          -325645.246***        -521444.198***  
##                                             (110178.609)           (95891.333)           (92914.346)           (90567.159)    
##   I(city): WALNUT GROVE/ANTELOPE             147503.606            170555.327            201680.160*           176185.672*    
##                                             (110178.609)           (95891.333)           (90105.359)           (85599.709)    
##   I(city): WEST SACRAMENTO/ANTELOPE          -61796.394            -18210.494              2333.941             10543.814     
##                                              (65455.848)           (96202.503)           (90145.364)           (85600.523)    
##   I(city): WILTON/ANTELOPE                   385012.006***         339685.841***         353104.416***         272613.956***  
##                                              (52091.231)           (50292.768)           (47073.435)           (45482.474)    
##   beds_factor: 2/1                                                  40595.665             31547.983             16944.781     
##                                                                    (31077.196)           (29297.723)           (27861.055)    
##   beds_factor: 3/1                                                  84829.843**           65201.765*            36047.548     
##                                                                    (30353.000)           (29219.995)           (27912.199)    
##   beds_factor: 4/1                                                 137747.100***          86174.984**           41812.893     
##                                                                    (30559.967)           (29972.835)           (28835.382)    
##   beds_factor: 5/1                                                 223881.188***         107159.498**           44493.010     
##                                                                    (32627.881)           (32949.484)           (31964.750)    
##   beds_factor: 6/1                                                 235726.979***           3610.240            -34646.955     
##                                                                    (62620.331)           (64775.715)           (61636.641)    
##   beds_factor: 8/1                                                 182700.098             31341.031            -72371.911     
##                                                                    (99052.713)           (96153.312)           (91942.041)    
##   baths_factor: 2/1                                                                       24145.642**           11391.362     
##                                                                                           (9038.637)            (8685.387)    
##   baths_factor: 3/1                                                                      124332.503***          84097.026***  
##                                                                                          (14063.672)           (14000.282)    
##   baths_factor: 4/1                                                                      149325.406***          85222.090***  
##                                                                                          (25574.416)           (25190.876)    
##   baths_factor: 5/1                                                                      460291.383***         429697.673***  
##                                                                                          (73110.206)           (69494.318)    
##   sq_m                                                                                                            615.510***  
##                                                                                                                   (64.323)    
## ------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                       0.408                 0.499                 0.564                 0.607     
##   adj. R-squared                                  0.385                 0.474                 0.540                 0.585     
##   sigma                                      108546.243             94422.256             88321.792             83864.677     
##   F                                              17.182                19.779                23.318                27.261     
##   p                                               0.000                 0.000                 0.000                 0.000     
##   Log-likelihood                             -12798.766            -11268.866            -11208.183            -11162.241     
##   Deviance                           11146043479543.883     7435579141776.336     6474613369753.001     5830592498359.043     
##   AIC                                         25677.531             22625.732             22512.366             22422.482     
##   BIC                                         25873.237             22835.898             22741.638             22656.531     
##   N                                             985                   877                   877                   877         
## ==============================================================================================================================

From the initial univariate regression analysis, we see that the most expensive area is Granite Ville. However upon considering square meter, number of bedrooms and bathrooms Garden Valley ends up having the most expensive proeprties, then Granite Ville.

Visualization of sales

From the above graph we note that the sales density is largest towards Sacramento and immediately adjascent areas. However, sales prices increase the further out the property is relative to Sacramento.

This is interesting because generally in large cities in the Americas, property prices tend to increaes the closer one is to the main city, for example this is observed in San Francisco, Boston, New York, and even cities outside the USA like Bogota, Colombia.

The pricing patterns are paradoxical to what one would traditionally expect.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

What we observe when plotting property prices relative to the location is that Sacramento city plays a huge role in stabilizing pricing property. As the property gets further away from Sacramento, prices begin to become less linearly related to the property’s size.

Plot Two

Description Two

This plot helps to confirm insight provided by plot one, in which we observe that prices tend to rise the further away one gets from Sacramento.

Plot Three

```

Description Three

In the third visualization, we note acknowledge there are other factors which can influence the price of a transaction other than location and size, such as the number of bedrooms and bathrooms.

In fact, we also note this in our multi-linear regression model, and point out how these two co-variates impact the cost of the most expensive property in this sample (see multi-linear regression section).


Reflection

I was interested in analyzing this particular set of data, because it contained an appropriate number of geolocalization information to construct geospatial maps, while still allowing me to apply the different techniques learnt during the exploratory data analysis course. In addition to this, I am currently searching for property to rent and I thought it woudl be useful to leverage this opportunity to gain insight into a real life scenario.

A lot of the coding was straight forward, however I did have to try multiple different tools/techniques before being able to graph the map relative to the available data. I woudl say anything relating map data to the current data was a bit of a challenge, however geospatial maps were acomplished. Another challenge I ran into was with parsing the date-time data. In the end, I dropped the data, however in my initial attempt, I was able to separate the day from the rest of the date-time data. Ulteimately the data was not useful so I dropped it.

I learnt to transform values in a column like 0 into NaN or NA, and recently learnt that NA is a non-numeric value, while NaN is a numeric value.

Ultimately while I see the power of R to compute big data in mere seconds, and perform complex visualizations. I still prefer python. I’m also quite satisfied with the linear regression model generation. It was suprising to me that location was more indicative of pricing when compared to over variables like size. I guess the old saying “location, location, location” really makes an impact afterall. In the future this data, can be combined with a yearly account to really try and grasp deeper tendencies (afterall this data only represents sales in a 5 day period). Additionally, other variables like family size would be interesting to integrate. Finally I would recommend that the creators of the dataset reorganize the variable type, because at present it is not very informative.